Skip to content

Conversation

Angazenn
Copy link
Contributor

@Angazenn Angazenn commented Sep 29, 2025

What this PR does / why we need it?

This PR aims to add padding logic to seq_lens、block_tables when running in full decode scenario. Before this PR, the number of input tokens with padding might exceeds corresponding seq_lens. For example, when running in full decode scenario:

input_ids : [1, 3, 0, 0]
seq_lens: [2, 1]
query_start_loc: [0, 1, 2]

Here, input_ids is padded by 2 tokens while seq_lens/query_start_loc are not. The mismatch between input_ids and seq_lens/query_start_loc might cause some potential bugs. This PR would change it into :

input_ids : [1, 3, 0, 0]
seq_lens: [2, 1, 1, 1]
query_start_loc: [0, 1, 2, 3, 4]

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Signed-off-by: Angazenn <[email protected]>
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces padding logic for attention metadata, which is necessary for ACL graph capture with dynamic batch sizes. The changes in vllm_ascend/worker/model_runner_v1.py correctly control when this padding is applied. However, I've found a couple of critical issues in the padding implementation in vllm_ascend/attention/attention_v1.py. Specifically, there's an off-by-one error when padding query_start_loc_cpu, and query_lens is not being padded at all, which will lead to tensor shape mismatches and incorrect attention computation. I've provided a code suggestion to fix these issues.

Comment on lines 212 to 220
padded_num_tokens = common_attn_metadata.graph_pad_size - num_actual_tokens
seq_lens = torch.cat([seq_lens, torch.ones(padded_num_tokens, dtype=seq_lens.dtype, device=seq_lens.device)])
block_table_padding = torch.zeros(
(padded_num_tokens, ) + block_table.shape[1:],
dtype=block_table.dtype,
device=block_table.device)
block_table = torch.cat([block_table, block_table_padding],
dim=0)
query_start_loc_cpu = torch.cat([query_start_loc_cpu, torch.arange(query_start_loc_cpu[-1] + 1, query_start_loc_cpu[-1] + padded_num_tokens, dtype=query_start_loc_cpu.dtype, device=query_start_loc_cpu.device)])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There are two issues in this padding logic that will cause problems during execution:

  1. query_start_loc_cpu padding is incorrect: The call to torch.arange has an off-by-one error. The end parameter is exclusive, so torch.arange(start, end) generates end - start elements. Your code generates padded_num_tokens - 1 elements, but you need to pad padded_num_tokens elements to match the padded seq_lens. This will cause a tensor size mismatch. The end argument should be query_start_loc_cpu[-1] + padded_num_tokens + 1.

  2. query_lens is not padded: You are padding seq_lens, block_table, and query_start_loc_cpu, but query_lens (calculated before this block) is not updated. This will lead to a mismatch in the number of sequences between query_lens and other padded tensors when creating AscendMetadata, causing errors downstream where query_lens is used. query_lens should also be padded with ones for the new sequences.

I've provided a suggestion to fix both issues.

            padded_num_tokens = common_attn_metadata.graph_pad_size - num_actual_tokens
            seq_lens = torch.cat([seq_lens, torch.ones(padded_num_tokens, dtype=seq_lens.dtype, device=seq_lens.device)])
            query_lens = torch.cat([query_lens, torch.ones(padded_num_tokens, dtype=query_lens.dtype, device=query_lens.device)])
            block_table_padding = torch.zeros(
                (padded_num_tokens, ) + block_table.shape[1:],
                dtype=block_table.dtype,
                device=block_table.device)
            block_table = torch.cat([block_table, block_table_padding],
                                    dim=0)
            new_query_locs = torch.arange(
                query_start_loc_cpu[-1] + 1,
                query_start_loc_cpu[-1] + padded_num_tokens + 1,
                dtype=query_start_loc_cpu.dtype,
                device=query_start_loc_cpu.device)
            query_start_loc_cpu = torch.cat([query_start_loc_cpu, new_query_locs])

Signed-off-by: Angazenn <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant